Fast and linearizable concurrent priority queue via dynamic aggregation of operations

ABSTRACT

Embodiments of the invention improve parallel performance in multi-threaded applications by serializing concurrent priority queue operations to improve throughput. An embodiment uses a synchronization protocol and aggregation technique that enables a single thread to handle multiple operations in a cache-friendly fashion while threads awaiting the completion of those operations spin-wait on a local stack variable, i.e., the thread continues to poll the stack variable until it has been set or cleared appropriately, rather than rely on an interrupt notification. A technique for an enqueue/dequeue (push/pop) optimization uses re-ordering of aggregated operations to enable the execution of two operations for the price of one in some cases. Other embodiments are described and claimed.

FIELD OF THE INVENTION

An embodiment of the present invention relates generally to dynamicaggregation and optimization of concurrent entities in a computingsystem and, more specifically, to dynamic aggregation of operations toaccelerate accessing and modifying a list, priority queue, or otherconcurrent entity, concurrently in a scalable and linearizable fashion,and the use of the list to optimize operations on a concurrent entitysuch as a priority queue.

BACKGROUND INFORMATION

Various mechanisms exist for accessing and modifying priority queues.Priority queues are a modified queue construct. In a first-in-first-outqueue (FIFO), for instance, the first element to be placed on the queue(enqueued or pushed) is the first to be removed (dequeued or popped). Ina priority queue, the first element to be removed (popped) is theelement in the queue with the highest priority at the time. Thus, thepriority queue may have an underlying representation that favors theordered dequeuing of elements based on a user-defined priority on thoseelements.

Priority queues are often used for storage of information or tasks inmulti-threaded computing systems. However, in a multi-threaded system,more than one thread may try to access the queue at the same time. Thus,existing systems may implement a series of locking algorithms to ensurethat there is no contention for adding and removing items, and tomaintain the underlying representation of the priority queue. Thepriority queue is often maintained with a heap data structure, for easeof locating the highest priority element and restructuring afteradditions to and removals from the heap.

Many algorithms for implementing concurrent priority queues are based onmutual exclusion (locking). However, mutual exclusion algorithms causeblocking, which may degrade overall performance of the system. Further,thread locking typically allows only one thread to operate at a time,which reduces this method to that of serial performance with theaddition of locking overhead. In some cases, the overhead of this methodmay lead to worse than serial performance. Another type of solution isthat some algorithms may attempt to perform concurrently within thequeue itself. In this case, multiple threads are operating on the queueat the same time, so access to each element of the queue must protected.This may use an internal locking or other system to prevent races onspecific queue elements. Such algorithms improve concurrency, but manysuch algorithms were too complex and suffered high overhead from theinternal locking, and so did not perform well in practice.

BRIEF DESCRIPTION OF THE DRAWINGS

The features and advantages of the present invention will becomeapparent from the following detailed description of the presentinvention in which:

FIG. 1 illustrates a representation of an example pending_operationslist having items to be put on the priority queue, and also shows thetimelines of the three threads that are waiting on the operations in thepending_operations list, according to an embodiment of the invention;

FIG. 2 illustrates the array to hold a priority queue, and how theunderlying heap representation of the priority queue maps onto thearray, according to an embodiment of the invention;

FIG. 3 illustrates a representation of a priority queue with heap-sorteditems and unsorted items, according to an embodiment of the invention;

FIGS. 4-8 show an example of the enqueue and dequeue operationsperformed by the handler on a concurrent priority queue, according to anembodiment of the invention

FIG. 9 is an illustration of an example dequeue list, or pop_list,according to an embodiment of the invention;

FIG. 10 is a flow chart illustrating a method for integratingsynchronization and aggregation and enqueue/dequeue optimization forpushing and popping items to/from a concurrent priority queue, accordingto an embodiment of the invention;

FIG. 11 shows various throughput measures for embodiments of theinvention, as compared to benchmark metrics; and

FIG. 12 is a block diagram of an example computing system on whichembodiments of the invention may be implemented.

DETAILED DESCRIPTION

Embodiments of the invention improve parallel performance by serializingconcurrent priority queue operations in a way that improves throughput.It will be understood that while priority queues are used to illustratean example embodiment of the invention, other embodiments may operate onother types of concurrent entities, or concurrent data structures. Anembodiment has two components: (1) a synchronization protocol andaggregation technique that enables a single thread to handle multipleoperations in a cache-friendly fashion while threads awaiting thecompletion of those operations spin-wait on a local stack variable,i.e., the thread continues to poll the stack variable until it has beenset or cleared appropriately, rather than rely on an interruptnotification; and (2) an enqueue/dequeue (push/pop) optimization thatuses re-ordering of aggregated operations to enable the execution of twooperations for the price of one.

Reference in the specification to “one embodiment” or “an embodiment” ofthe present invention means that a particular feature, structure orcharacteristic described in connection with the embodiment is includedin at least one embodiment of the present invention. Thus, theappearances of the phrase “in one embodiment” appearing in variousplaces throughout the specification are not necessarily all referring tothe same embodiment.

For purposes of explanation, specific configurations and details are setforth in order to provide a thorough understanding of the presentinvention. However, it will be apparent to one of ordinary skill in theart that embodiments of the present invention may be practiced withoutthe specific details presented herein. Furthermore, well-known featuresmay be omitted or simplified in order not to obscure the presentinvention. Various examples may be given throughout this description.These are merely descriptions of specific embodiments of the invention.The scope of the invention is not limited to the examples given.

In an example system, a priority queue may be used to schedule graphicsoperations for a graphical user interface (GUI). In a GUI system, theremay be a large number of inputs arriving from various sources. Someinputs may be more important than others. The inputs, or desiredoperations, will be received and the process that is to perform theoperations will place the items into the queue in priority order, andthen process them by priority.

In another example system, priority queues may be used for paralleldiscrete event simulation. Modeling network traffic is an example ofthis type of simulation. In this type of application, the priority maybe a timestamp for when an event must occur. It will be understood bythose of skill in the art that priority queues may be implemented in avariety of applications, and that optimizing the queue operations mayresult in a significant enhanced performance of the system.

It will be understood embodiments of the invention may be used for awide variety of applications. Priority queue inputs may begenerated/received on various processes, which in turn will place theinputs in the queue. It will be understood that embodiments of theinvention operate with multiple processes adding items to the one queue,concurrently. One or more other processes concurrently remove the itemsfrom the priority queue to process them. The queue itself, via definedpush/enqueue and pop/dequeue operations, organizes the inputs into someinternal data structure and answers requests for items by removing themin priority order.

An embodiment of the invention integrates the two components(synchronization and aggregation; and enqueue/dequeue optimization) toprovide a system for maintaining a concurrent priority queue thatminimizes cache misses and speeds up heap sorting of the priority queue.Each component utilizes specific data structures and techniques to aidin implementation. In an example embodiment, the following terms orstructures may be used, as defined below, although it will be apparentto those of skill in the art that alternative data structures may beused while still enjoying advantages of embodiments of the invention.

-   -   pending_operations list: a pointer to a linked list of        operations (e.g. push and pop in the case of a concurrent        priority queue) to be performed by a thread    -   op_list: a pointer to the original pending_operations list at        the point when a handler thread takes control of the list to        perform the pending operations on the priority queue    -   priority queue: The actual internal data structure which stores        the items that are pushed into the concurrent priority queue.        The data structure may be an array, in heap representation, with        the highest priority item being at the front of the array (i.e.        the top of the heap).    -   active handler: The first thread to place an item on the        pending_operations list becomes the active handler when that        list is ready to be acted upon.    -   waiting handler: When a thread is the first to place an item on        the pending_operations list, it becomes the waiting handler,        awaiting a flag (handler_busy flag) to be reset to FALSE.    -   cpq_operation: The thread operation, or operation node, that is        to be put on the pending_operations list, representing an        operation (e.g. push or pop) that is to be applied to the        priority queue.    -   handler_busy flag: When a waiting handler becomes an active        handler, it sets this flag to TRUE to prevent another thread        from trying to become handler. When the active handler is        finished with the appointed tasks, it resets this flag to be        false, and resumes normal operations.    -   pop_list: A list of cpq_operations representing dequeue        operations only to be applied to the priority queue.    -   compare-and-swap (CAS): An atomic operation used in embodiments        of the invention to put an item into a linked list where        multiple threads may be trying to modify the list        simultaneously. The atomic nature of this instruction obviates        the need to lock the list from other threads.    -   fetch-and-store: An atomic operation used in embodiments of the        invention to grab a list and replace it with a null list,        usually implemented by changing pointers to the lists.

The synchronization and aggregation component of an embodiment utilizesthe handler_busy flag, pending_operations list and op_list. Operationson the priority queue will be discussed in conjunction with the secondcomponent. It will be understood that each component, as discussedherein, may be implemented in a system without implementing the othercomponent, e.g., the components may stand alone. Embodiments of theinvention, however, may utilize both components in an integratedfashion.

Embodiments of the invention allow each thread to schedule its push andpop operations by using a list known herein as the pending_operationslist. While the terms push and pop sometimes infer a stack datastructure (LIFO), the terms are used herein to be synonymous withenqueue and dequeue, and used for simplicity. These operations stored inthe pending_operations list, are later performed by a handler thread onthe internal data structure that represents the priority queue, and thethread requesting the queue operation waits until it receivesnotification that its operation has been completed. Thus, since only onethread handles the pending_operations to modify the internal priorityqueue, no locking mechanism is needed to prevent thread contention onthe internal priority queue. As the handler performs operations andnotifies a waiting thread that its operation has been completed, thethread may go on to request another operation on the concurrent priorityqueue. However, this new operation will be placed on the nextpending_operations list, and not the one currently being operated upon(op_list) because once a handler takes control of the pending_operationslist, it atomically fetches the list (fetch-and-store) into op_list, andstores a NULL pointer in pending_operations (to signify that the list isempty) for subsequent threads to add operations to. Eachpending_operations list is handled in turn in this way, with eachinstance that is stored in op_list containing no more that P operations,where P is the number of threads running. Thus, no thread must wait formore than P operations before its operation is completed.

Because each thread waits until its operation is completed before beingable to schedule another operation, there may be only a total of Poperations outstanding at once, and only two lists are necessary: theactive list (op_list) being operated on by the handler thread, and thewaiting list (pending_operations) where threads continue to add items.For instance, the active list may have n items, where 0≦n≦P and thewaiting list may have m items where 0≦m≦P−n.

Referring now to FIG. 1, there is shown a representation of an examplepending_operations list having items to be put on the priority queue. Itwill be understood that the terms push and pop may be usedinterchangeable with the terms enqueue and dequeue, respectively, withno loss of generality or specificity in the description herein. In anembodiment of the invention, a pending_operations list may be used. Asthreads attempt to enqueue items to and dequeue items from the priorityqueue, they instead add a record of the desired operation to thepending_operations list.

In this example, there are three threads T1 (101), T2 (102) and T3(103). It will be understood that there may be 1 to P threads operatingat the same time. Each thread is to add an operation on thepending_operations list 110, as necessary. However, a thread may onlyadd one operation to the list, because the thread is then put into aspin-wait until that operation is actually performed on the priorityqueue. Threads may attempt to concurrently insert push and popoperations onto the pending_operations list 110. In an embodiment,operations on the list are concurrent priority queue operations, or acpq_operation. To add an operation to the list, an atomiccompare-and-swap (CAS) is used for entry to the list. The list istypically a linked list. The thread uses an atomic CAS operation to adda cpq_operation, or operation node, to the list 110, in order to ensureanother thread has not modified the list pointers while it is preparingto add its operation to the list. The operation to be added to the listis then added to the front of the pending_operations list. If the CASoperation fails, it means another thread added an item to the listfirst, and now the next pointers are different. In this case, the threadwill rebuild the pointers in its operation node and attempt the CASagain, until the operation has been successfully added to the list.

In an embodiment of the invention, the first thread to place acpq_operation on the pending_operations list becomes the handler forthat list. Other threads that have placed an operation node on the listwill wait on a ready flag before being permitted to continue normaloperations. In this example, T1 101 is to push item 42 onto the priorityqueue. T2 102 wants to pop an item from the priority queue, and T3 103is to push item 27 to the priority queue. In the example of FIG. 1,thread T2 102 was the first to successfully add an operation topending_operations 110 (its next pointer is NULL) and thus T2 willeventually become the handler for the pending_operations list.

According to embodiments of the invention, there is at most one activehandler (processing op_list) and at most one waiting handler (waiting toprocess pending_operations) at a time. A waiting handler spin-waits on aglobal handler_busy flag. When the handler_busy flag becomes FALSE, thewaiting handler sets the flag to TRUE and becomes the active handler.The active handler then uses the fetch_and_store to movepending_operations to op_list. When the active handler is finishedhandling operations on the op_list, to be discussed more fully below, itsets handler_busy to FALSE. When a handler becomes active, it atomicallypoints op_list at the pending_operations list and pointspending_operations to NULL so that the addition of subsequent operationsform a new list. It will be understood that these grab and replaceoperations may be performed by changing the pointers which identify thevarious lists, and the elements of these lists need not be fully copiedto a new list. The op_list represents an aggregation of concurrentoperations to be performed on the priority queue that may be seriallyprocessed in an optimal fashion. The active handler will execute theoperations in the op_list, as discussed for the enqueue/dequeueoptimization component, below, and new operations to be handled by thewaiting handler are put into the new pending_operations list.

In an embodiment, the priority queue is represented by an array-basedheap, as shown in FIG. 2. In this embodiment, the queue is stored as anarray 201 where the numbers indicate the position in the array. Thepriority queue is represented as a heap 203 where the numbers in theblocks correspond to the item's location in the array 201. It will beunderstood that there are various ways of implementing queues and heapsusing data structures available in the computing sciences. Thedescription herein is meant to be illustrative and not limiting todeployment and implementation. In an embodiment, the priority queuebeing stored in an array necessitates a need to maintain a balancedheap. Thus, pushing items onto the bottom of the heap helps maintain thebalance, before re-sorting the heap, also known as heap-sorting. Whilethe heap could be stored in a linked list in some implementations,storing the heap in an array produces less overhead.

Now referring to FIG. 3, there is shown a representation of the priorityqueue 300, according to an embodiment of the invention. In this queue,there are items which have been sorted into the prioritized heap 301,and also unsorted items 303 which have been pushed onto the queue, butnot yet sorted. Various items of the queue are identified such as thetop 311, a mark 313 for the beginning of the unsorted items, the size315 of the queue, and the capacity 317 of the queue. In existingsystems, items are typically sorted immediately when placed on thepriority queue, which requires on the order of O(log n) operations foreach item added to the heap, where log n is the height of the heap, andn is the number of items in the heap. However, embodiments of theinvention optimize sort operations on the queue by postponing thesorting for some items 303.

The active handler examines each cpq_operation of waiting threads in theop_list in turn. When the operation is a push, or enqueue, the item tobe pushed is copied from the operation node to the end of the heap inthe unsorted area 303 in O(1) time. This time may be somewhat increasedin the infrequent event that the queue needs to be resized, e.g., whencapacity 317 will be exceeded by adding an element to the queue. Thehandler then sets the ready flag for the cpq_operation which enables thethread that was waiting on that push operation to continue with otherwork.

When the operation is a pop, two checks are made by the handler. First,if the heap 301 is empty and there are no recently pushed items in theunsorted section 303, then the ready flag is set to allow the thread tocontinue. In some implementations, the ready flag returns a valueindicating the success or failure of the pop operation to obtain an itemfrom the priority queue. Second, if the unsorted area 303 is not emptyand if the last added item (in the unsorted area 303) has a higherpriority than the top of the heap 311 (assuming the heap 301 is notempty), then that item is returned, and the ready flag is set for thewaiting thread. If the heap 301 is empty, the last added item isreturned. This immediate pop of an item in the unsorted list 303obviates the need for that item to ever be sorted into the heap 301,thereby saving operational costs of sorting the heap. It should be notedthat there may be other operation nodes in the unsorted list that have ahigher priority than the last one pushed, even though that operationnode had a higher priority than the top node 311. However, all items inthe unsorted list are assumed to be concurrent, as they all arrived“simultaneously” from different threads, and are to be handled by asingle thread. Thus, it is important only that the priority be greaterthan the top node, and not the other unsorted nodes. If neither of theabove cases holds, then the cpq_operation is set aside on another list,pop_list, as shown in FIG. 9, for later dequeuing from the priorityqueue. The pop_list 900 is merely a list, typically a linked list, forholding those pop operations that were not able to be immediatelyhandled in constant time. Since there are at most P aggregatedoperations, where P is the number of threads, this first pass throughthe op_list will take O(P) time. The active handler next re-examineseach cpq_operation in the pop_list in turn. The same checks as describedabove are made.

If the heap and unsorted items section are both empty, the ready flag isset, so that the thread waiting on the pop/dequeue operation may resumenormal execution. If the last added element (in the unsorted section)has higher priority than the top of the non-empty heap, or the heap isempty, that element is returned and the ready flag set on the waitingthread. If neither case holds, the top of the heap, as highest priority,is returned, the ready flag is set, and one of the following is placedat the top of the heap, and pushed down into the heap until it is in theproper place: (a) the last added (unsorted) item; or (b) the last itemin the heap, if there are no recently added unsorted items.

Pushing a new item into the heap takes O(log N) time. If an element canbe popped and replaced with a recently added item, then the pushoperation is virtually free because to add the unsorted item to the heaprequires O(log N), but had it been inserted into the heap initially, andthen re-sorted after the pop, then it would require 2*O(log N).

When all operations are completed from op_list and pop_list, eitherenqueued or dequeued to/from the priority queue, and all waiting threadshave been released except for the handler, the handler will check forany remaining unsorted pushed items 303 in the priority queue. If thereexist unsorted items, then the handler launches a heapify operation tomerge those items into the heap, and sort the queue appropriately. Bythe time the handler thread has finished with operating on the op_list,pop_list and priority queue, all unsorted (concurrent) items will haveeither been sorted into the priority queue heap, or popped from thequeue.

The handler_busy flag is then reset to allow the waiting handler tobegin operations on the pending_operations list (to become the newop_list). Up to P operations are handled by one handler thread at thesame time. Only the active handler thread accesses the heap datastructure and associated variables of the priority queue. Thus, bettercache behavior will result since multiple threads are not accessingmemory at the same time.

Referring to FIG. 10, a flow chart illustrating both the synchronizationand aggregation, as well as the enqueue/dequeue optimization is shown,according to an embodiment of the invention. In an embodiment, thesynchronization and aggregation component, threads record an enqueue(push) or dequeue (pop) operation onto a list, rather than perform theoperation on the actual concurrent priority queue, starting in block1001. The threads store information about that operation in a localstack variable op_info (1003), and then prepend a pointer to thatinformation to the pending_operations list via the atomiccompare_and_swap operation (1005) on the head of the list, as discussedabove. In the process of adding this information, the thread will eitherbe the first thread to add information to the list, or the list willalready have information about operations from other threads. If thethread is not the first, as determined in block 1007, it simplyspin-waits on a ready flag (1009), a field of the local stack variableop_info, until the operation is complete. However, if the thread is thefirst to add to the pending_operations list, it becomes the handler ofthe operations for that list, and waits for the handler_busy flag to bereset (1015) before acting on the operations in the pending_operationslist. Other threads may continue to add operations to thepending_operations list until the waiting handler thread is able tobegin operating on the list, when the handler_busy flag becomes FALSE.When a non-handler thread's ready flag is set, as determined in 1009, itknows that its operation has been executed on the priority queue and cancomplete any additional work required by the operation at block 1011before returning from the operation at 1013.

The handler's responsibility is to process the operations in thepending_operations list (now pointed to by op_list) and enqueue ordequeue the items to/from the concurrent priority queue. To do this, thehandler first ensures that it is the only active handler by spin-waitingon a non-atomic handler_busy flag (1015). This flag need not be atomic,because at most one thread will be actively handling operations, and atmost one thread will be trying to become the handler at a time. When thehandler_busy flag becomes unset, or changed to FALSE, the next waitinghandler sets the flag (TRUE) in block 1017, and becomes the activehandler. The active handler may then atomically obtain thepending_operations list via a fetch_and_store operation, in block 1019,leaving behind an empty list to which threads with subsequent operationsmay append (an empty pending_operations list). It should be noted thatthe thread that adds the first item to that empty list will be the nextthread to wait for the handler_busy flag to become unset and will becomethe next active handler.

Meanwhile, the active handler proceeds to handle the operations in thepending_operations list, which is now pointed to by op_list. The loopbeginning at block 1021 effectively handles each operation in theop_list until there are no more items in the list. Each item associatedwith an enqueue operation is placed at the end of the priority queueinto an unsorted list, in block 1023, which takes a constant time O(1).If the operation is a dequeue (pop) operation, then the operation isadded to the pop_list for later processing, unless it can be performedin constant time.

The underlying representation of the priority queue is a heap stored inan array, as discussed above. On their own, the individual enqueue anddequeue operations would each take O(log n) time to execute, where n isthe number of elements already in the heap. In an embodiment, theaggregated operations are re-ordered so that enqueue operations areperformed first, and dequeue operations second. The list is processedone operation at a time. Each enqueue operation to the priority queue ishandled in constant time by placing the new element at the very end ofthe queue, and setting the ready flag on the waiting thread so that itcan resume execution on other work. Each dequeue operation is examinedto see if it can be handled in constant time: if the priority queue isempty, the operation may return an unsuccessful status; or, if there arenewly added elements that have not been inserted into the heap yet, thelast entered element may be examined to see if it has higher prioritythan the current top of the heap (e.g., add the item to the top, not thebottom), and if so, that element is returned. For these cases, the readyflag is set so that the waiting thread can continue. In most cases,neither of these situations occur, and the dequeue operation is setaside in a separate pop_list for later processing.

Once all operations in op_list have been examined and either handled inconstant time or moved to pop_list, the handler parses through theremaining dequeue operations in the pop_list, in blocks 1025 and 1027.For each dequeue operation, the current top of the heap is compared withthe last element in the priority queue's heap array (unsorted items). Ifthe last element has higher priority (because it may be a newly enqueuedelement that has not been pushed into the sorted heap yet), it is simplyreturned as the result of the dequeue operation, and the ready flag isset. If the top of the heap has higher priority, the top element isextracted and returned, and the ready flag is set. The last element inthe unsorted portion of the list is then placed at the top of the heapin place of the item just removed, and then pushed down the heap untilit is in the proper position using a typical O(log n) heap insertoperation.

Before the handler can safely give control over to the next handler, itneeds to process any remaining enqueued elements that have not yet beenpushed into the heap during the processing of the pop_list. Theseelements are pushed up the heap from their current positions at thebottom, in block 1029. This process is also known as “heapifying” whichis performed in O(k+log n) time, where k is the number of remainingenqueued, but unsorted, elements.

When the handler is finished enqueuing or dequeuing all of theoperations from the original op_list, and has heapified any remainingunsorted items, it gives control over to the waiting handler byunsetting the handler_busy flag, in block 1031, and the cycle repeats.

In an embodiment, the enqueue/dequeue optimization component defines howa handler processes the op_list. Processing of the op_list elapses O(P)time, where P is the number of threads. Heap-sorting of the priorityqueue is not performed until dequeue operations (non-constant time) havebeen processed from the pop_list. Because threads wait for theiroperations before continuing processing, there can never be more thanone operation per thread in any pending_operations list, or op_list.Further, because each thread is represented in the list by at most oneoperation, and the list represents (nearly) simultaneous operations, theoperations may be handled in any order while still preserving thelinearizability of the algorithm.

The synchronization mechanism of the first component provides alinearization point for all the operations in a particular pendingoperation list—the time at which the handler thread completes thefetch_and_store operation that captures the pending_operations list. Itshould be noted that any two operations in different pending_operationslists will have a different linearization point; thus, the operationwith the later linearization point will operate using the effects of theearlier operations in the priority queue. In other words, two operationsi and j with linearization points t_(i) and t_(j) thus have thefollowing property: if t_(i)<t_(j), then operation j sees the effect ofoperation i on the priority queue. All operations in the same listhappen in a serial order determined by an algorithm described herein,while preserving serializability, but since they have the samelinearization point, they effectively happen simultaneously.

FIGS. 4-8 show an example of the enqueue and dequeue operationsperformed by the handler on the priority queue, according to anembodiment of the invention. Referring now to FIG. 4, there is shown anop_list 110 a in its start state, before the handler commences withenqueues and dequeues to/from the priority queue 400. The op_list wastaken from the fetched pending_operations 110). At this point, theheapified portion of the priority queue has five items (36, 22, 32, 14and 5), and the unsorted portion 403 is empty. The op_list 110 aillustrates that thread T1 101 is to push item 42 onto the queue, T3 103is to push item 27 onto the queue, and T2 102 is to pop an item (withthe highest priority).

FIG. 5 illustrates the enqueue operation of T1 101. The handler performsthe push (42) operation of T1 and adds item 42 to the unsorted portion403 a of the priority queue 400 a.

FIG. 6 illustrates the handler performing the pop operation of T2, afterthe item of T3 has also been pushed into the unsorted portion 403 b(because, as discussed above, all push operations and constant-time popoperations are handled before non-constant time pop operations). In thisexample, the pop operation of T2 cannot be performed in constant time,so it was added to the pop_list 601 for later dequeuing. As discussedabove, the handler checks to ensure that the heap 401 is not emptybefore performing the pop/dequeue. In this example, the last item, 27,in the unsorted portion of the priority queue is of a lower prioritythan the item at the top of the queue (36). Therefore, the handlerproceeds to dequeue the top item of the heap, as shown in FIG. 7. Item36 is removed from the heap 401 and the handler sets the ready flag sothat T2 may proceed. The last pushed item 27 is then placed where 36 wasremoved, and then sorted into the appropriate location based on relativepriority as shown in 401 a. The unsorted list 403 c now holds only item42.

After all items from the original op_list are enqueued and dequeuedto/from the priority queue by the handler, the remaining unsorted itemsin the priority queue are heapified, as illustrated in FIG. 8. Item 42in the unsorted list is inserted into the heap, and the entire heap 401b is heap-sorted by priority.

FIG. 11 shows throughput for embodiments of the invention, shown as“optimized” versus a spin-lock-wrapped STL (C++ Standard TemplateLibrary) priority queue for a simple benchmark under high, moderate andlow contention, for both single-thread-per-core and hyperthreading on an8-core Nehalem, available from Intel Corp. 100% contention correspondsto zero busy-wait time, each percent less adds approximately 0.025 μs tothe busy-wait time, so for example, 60% corresponds to 1 μs.

FIG. 12 is a block diagram of an example computing system 1200 on whichembodiments of the invention may be implemented. It will be understoodthat a variety of computer architectures may be utilized withoutdeparting from the spirit and scope of embodiments described herein.System 1200 depicts a point to point system with one or more processors.The claimed subject matter may comprise several embodiments, forinstance, one with one processor 1201, or a system with multipleprocessors and or multiple cores (not shown). In an embodiment, eachprocessor may be directly coupled to a memory 1210, and connected toeach processor via a network fabric which may comprise either or all of:a link layer, a protocol layer, a routing layer, a transport layer, anda physical layer. The fabric facilitates transporting messages from oneprotocol (home or caching agent) to another protocol for a point topoint network.

In another embodiment, the memory 1210 may be connected to the processor1201 via a memory control device. The processor 1201 may be coupled to agraphics and memory control 1203, depicted as IO+M+F, via a networkfabric link that corresponds to a layered protocol scheme. The graphicsand memory control is coupled to memory 1210 and may be capable ofreceiving and transmitting via peripheral component interconnect (PCI)Express Links Likewise, the graphics and memory control 1203 is coupledto the input/output controller hub (ICH) 1205. Furthermore, the ICH 1205is coupled to a firmware hub (FWH) 1207 via a low pin count (LPC) bus.Also, for a different processor embodiment, the processor may haveexternal network fabric links. The processor may have multiple coreswith split or shared caches with each core coupled to an X-bar routerand a non-routing global links interface. An X-bar router is a point topoint (pTp) interconnect between cores in a socket. X-bar is a“cross-bar” meaning that every element has a cross-link or connection toevery other. This is typically faster than a pTp interconnect link andimplemented on-die, promoting parallel communication. Thus, the externalnetwork fabric links are coupled to the X-bar router and a non-routingglobal links interface.

An embodiment of a multi-processor system (not shown) may comprise aplurality of processing nodes interconnected by a point-to-pointnetwork. For purposes of this discussion, the terms “processing node”and “compute node” are used interchangeably. Links between processorsare typically full, or maximum, width, and links from processors to anIO hub (IOH) chipset (CS) are typically half width. Each processing nodemay include one or more central processors coupled to an associatedmemory which constitutes main memory of the system. In alternativeembodiments, memory may be physically combined to form a main memorythat is accessible by all of processing nodes. Each processing node mayalso include a memory controller to interface with memory. Eachprocessing node including its associated memory controller may beimplemented on the same chip. In alternative embodiments, each memorycontroller may be implemented on a chip separate from its associatedprocessing node.

Each memory 1210 may comprise one or more types of memory devices suchas, for example, dual in-line memory modules (DIMMs), dynamic randomaccess memory (DRAM) devices, synchronous dynamic random access memory(SDRAM) devices, double data rate (DDR) SDRAM devices, or other volatileor non-volatile memory devices suitable for server or generalapplications.

The system may also include one or more input/output (I/O) controllers1205 to provide an interface for processing the nodes and othercomponents of system to access to I/O devices, for instance a flashmemory or firmware hub (FWH) 1207. In an embodiment, each I/O controller1205 may be coupled to one or more processing nodes. The links betweenI/O controllers 1205 and their respective processing nodes 1201 arereferred to as I/O links. I/O devices may include Industry StandardArchitecture (ISA) devices, Peripheral Component Interconnect (PCI)devices, PCI Express devices, Universal Serial Bus (USB) devices, SmallComputer System Interface (SCSI) devices, or other standard orproprietary I/O devices suitable for server or general applications. I/Odevices may be wire-lined or wireless. In one embodiment, I/O devicesmay include a wireless transmitter and a wireless transmitter receiver.

The system may be a server, a multi-processor desktop computing device,an embedded system, a network device, or a distributed computing devicewhere the processing nodes are remotely connected via a wide-areanetwork.

In an embodiment an operating system (OS) 1211 resides in memory 1210for executing on the processor 1201. It will be understood that thesystem architecture may include virtual machines or embedded executionpartitions running separate operating systems and/or separate cores orprocessors in a multi-core or multi-processor system. In either case,the OS 1211 operates in conjunction with multi-threaded applications1213. In an embodiment, a multi-threaded application 1213 requires theuse of a priority queue 1220, for efficient operation. Themulti-threaded application has a synchronization and aggregation module1215, as described above, for scheduling thread-based operation nodes(items to be pushed or popped from the priority queue 1220) onto apending operations list. Once a thread is designated as active handlerfor the pending operations list, as described above, the active handlerutilizes an enqueue/dequeue optimization module 1217 to optimize enqueueand dequeue operations of the priority queue 1220, according to theitems in the pending operation list.

The techniques described herein are not limited to any particularhardware or software configuration; they may find applicability in anycomputing, consumer electronics, or processing environment. Thetechniques may be implemented in hardware, software, or a combination ofthe two.

For simulations, program code may represent hardware using a hardwaredescription language or another functional description language whichessentially provides a model of how designed hardware is expected toperform. Program code may be assembly or machine language, or data thatmay be compiled and/or interpreted. Furthermore, it is common in the artto speak of software, in one form or another as taking an action orcausing a result. Such expressions are merely a shorthand way of statingexecution of program code by a processing system which causes aprocessor to perform an action or produce a result.

Each program may be implemented in a high level procedural orobject-oriented programming language to communicate with a processingsystem. However, programs may be implemented in assembly or machinelanguage, if desired. In any case, the language may be compiled orinterpreted.

Program instructions may be used to cause a general-purpose orspecial-purpose processing system that is programmed with theinstructions to perform the operations described herein. Alternatively,the operations may be performed by specific hardware components thatcontain hardwired logic for performing the operations, or by anycombination of programmed computer components and custom hardwarecomponents. The methods described herein may be provided as a computerprogram product that may include a machine accessible medium havingstored thereon instructions that may be used to program a processingsystem or other electronic device to perform the methods.

Program code, or instructions, may be stored in, for example, volatileand/or non-volatile memory, such as storage devices and/or an associatedmachine readable or machine accessible medium including solid-statememory, hard-drives, floppy-disks, optical storage, tapes, flash memory,memory sticks, digital video disks, digital versatile discs (DVDs),etc., as well as more exotic mediums such as machine-accessiblebiological state preserving storage. A machine readable medium mayinclude any mechanism for storing, transmitting, or receivinginformation in a form readable by a machine, and the medium may includea tangible medium through which electrical, optical, acoustical or otherform of propagated signals or carrier wave encoding the program code maypass, such as antennas, optical fibers, communications interfaces, etc.Program code may be transmitted in the form of packets, serial data,parallel data, propagated signals, etc., and may be used in a compressedor encrypted format.

Program code may be implemented in programs executing on programmablemachines such as mobile or stationary computers, personal digitalassistants, set top boxes, cellular telephones and pagers, consumerelectronics devices (including DVD players, personal video recorders,personal video players, satellite receivers, stereo receivers, cable TVreceivers) an embedded processor such as those coupled to an automobile,and other electronic devices, each including a processor, volatileand/or non-volatile memory readable by the processor, at least one inputdevice and/or one or more output devices. Program code may be applied tothe data entered using the input device to perform the describedembodiments and to generate output information. The output informationmay be applied to one or more output devices. One of ordinary skill inthe art may appreciate that embodiments of the disclosed subject mattercan be practiced with various computer system configurations, includingmultiprocessor or multiple-core processor systems, minicomputers,mainframe computers, as well as pervasive or miniature computers orprocessors that may be embedded into virtually any device. Embodimentsof the disclosed subject matter can also be practiced in distributedcomputing environments where tasks or portions thereof may be performedby remote processing devices that are linked through a communicationsnetwork.

Although operations may be described as a sequential process, some ofthe operations may in fact be performed in parallel, concurrently,and/or in a distributed environment, and with program code storedlocally and/or remotely for access by single or multi-processormachines. In addition, in some embodiments the order of operations maybe rearranged without departing from the spirit of the disclosed subjectmatter. Program code may be used by or in conjunction with embeddedcontrollers.

While this invention has been described with reference to illustrativeembodiments, this description is not intended to be construed in alimiting sense. Various modifications of the illustrative embodiments,as well as other embodiments of the invention, which are apparent topersons skilled in the art to which the invention pertains are deemed tolie within the spirit and scope of the invention.

1. A computer implemented system, comprising: a processor coupled to memory, the processor configured to execute multi-threaded application programs; a multi-threaded application executing on the processor, the multi-threaded application program configured to utilize a concurrent entity, and wherein a plurality of threads of the multi-threaded application are configured to generate a plurality of operation nodes to operate upon the concurrent entity concurrently with other threads; a synchronization and aggregation logic component coupled to the multi-threaded application and configured to accept operation nodes from the plurality of threads, each operation node corresponding to a single thread, the accepted operation nodes to be placed in a temporary list, the operation nodes defining an operation to perform on the concurrent entity, wherein only one thread, known as a handler thread, is permitted to operate on the temporary list to perform the operations on the concurrent entity, and wherein each thread is permitted to provide only one operation node to the temporary list and waits until the corresponding operation node has been processed by the handler thread before being permitted to provide another operation node to a second temporary list; and the concurrent entity stored in the memory accessible to the multi-threaded application program, wherein the concurrent entity comprises a concurrent priority queue configured to accept enqueue and dequeue operations, and wherein each operation node comprises one of either an enqueue or dequeue operation.
 2. The system as recited in claim 1, wherein the handler thread is configured to poll a handler_busy flag to determine whether operation on the temporary list is to commence, and wherein the handler thread for the temporary list is a thread having added an operation node to the temporary list before any other thread.
 3. The system as recited in claim 1, wherein a thread is configured to poll a ready flag after placing an operation node on the temporary list, the ready flag being an indicator of whether the thread is permitted to continue operation, and wherein the handler thread is configured to reset the ready flag corresponding to the thread after processing the operation node from the temporary list.
 4. The system as recited in claim 1, wherein the handler thread is configured to retrieve the temporary list into a new pointer object, and to replace the original temporary list with an empty list, both via an atomic operation.
 5. The system as recited in claim 1, wherein the threads are configured to place an operation node in the temporary list via an atomic operation.
 6. The system as recited in claim 1, further comprising: an enqueue/dequeue optimization logic component coupled to the multi-threaded application and having access to the concurrent priority queue, wherein the handler thread is configured to operate on the temporary list to process both enqueue operation nodes as they apply to the concurrent priority queue as well as constant-time dequeue operations, prior to operating on non-constant-time dequeue operations, before re-sorting items in the concurrent priority queue, and wherein enqueue operations are performed before all non-constant time dequeue operations, and wherein the handler thread is further configured to provide a fully heap-sorted concurrent priority queue to the multi-threaded application after performing all operation nodes and re-sorting the concurrent priority queue.
 7. The system as recited in claim 6, wherein the concurrent priority queue comprises an array data structure stored in the memory, where, when non-empty, at least a portion of the array elements are sorted into a heap.
 8. The system as recited in claim 7, wherein the hander thread is configured to process enqueue and dequeue operation nodes from the temporary list one at a time, wherein enqueue operation nodes are added to an unsorted portion of the concurrent priority queue until a dequeue operation node is processed, and wherein when a last item added to the unsorted portion of the priority queue has a higher priority than a highest priority item in the heap, the thread is configured to perform a dequeue operation by returning the last enqueued item added to the unsorted portion, and when all enqueue operation items have been added to the unsorted portion of the concurrent priority queue, then performing any remaining dequeue operation nodes and re-sort the heap portion of the concurrent priority queue as operation nodes are removed from the heap.
 9. A computer implemented system, comprising: a processor coupled to memory, the processor configured to execute multi-threaded application programs; a multi-threaded application executing on the processor, the multi-threaded application program configured to utilize a concurrent priority queue stored in the memory and configured to accept enqueue and dequeue operations, and wherein a plurality of threads of the multi-threaded application are configured to generate a plurality of operation nodes to operate upon the concurrent priority queue concurrently with other threads, and wherein each operation node comprises one of either an enqueue or dequeue operation; an enqueue/dequeue optimization logic component coupled to the multi-threaded application and having access to the concurrent priority queue, wherein a handler thread is configured to operate on the temporary list to process both enqueue operation nodes as they apply to the concurrent priority queue as well as constant-time dequeue operations, prior to operating on non-constant-time dequeue operations, before re-sorting items in the concurrent priority queue, and wherein enqueue operations are performed before all non-constant time dequeue operations, and wherein the handler thread is further configured to provide a fully sorted concurrent priority queue to the multi-threaded application after performing all operation nodes and re-sorting the concurrent priority queue.
 10. The system as recited in claim 9, wherein the concurrent priority queue comprises an array data structure stored in the memory, where, when non-null, at least a portion of the array elements are sorted into a heap.
 11. The system as recited in claim 10, wherein the handler thread is configured to process enqueue and dequeue operation nodes from the temporary list one at a time, wherein enqueue operation nodes are added to an unsorted portion of the concurrent priority queue until a dequeue operation node is processed, and wherein when a last item added to the unsorted portion of the priority queue has a higher priority than a highest priority item in the heap, the thread is configured to perform a dequeue operation by returning the last enqueued item added to the unsorted portion, and when all enqueue operation items have been added to the unsorted portion of the concurrent priority queue, then performing any remaining dequeue operation nodes and re-sort the heap portion of the concurrent priority queue as operation nodes are removed from the heap.
 12. The system as recited in claim 9, wherein the handler thread is further configured to reset a handler_busy flag after completing processing of the temporary list, the handler_flag to indicate that a waiting thread may commence processing on a second temporary list comprising operation nodes.
 13. The system as recited in claim 9, further comprising: a synchronization and aggregation logic component coupled to the multi-threaded application and configured to accept operation nodes from the plurality of threads, each operation node corresponding to a single thread, the accepted operation nodes to be placed in a temporary list, the operation nodes defining an operation to perform on the concurrent priority queue, wherein only one thread, known as a handler thread, is permitted to operate on the temporary list to perform the operations on the concurrent priority queue, and wherein each thread is permitted to provide only one operation node to the temporary list and waits until the corresponding operation node has been processed by the handler thread before being permitted to provide another operation node to a second temporary list; and the temporary list stored in the memory accessible to the enqueue/dequeue optimization logic component for processing with the concurrent priority queue.
 14. The system as recited in claim 13, wherein the handler thread is configured to poll a handler_busy flag to determine whether operation on the temporary list is to commence, and wherein the handler thread for the temporary list is a thread having added an operation node to the temporary list before any other thread.
 15. The system as recited in claim 13, wherein a thread is configured to poll a ready flag after placing an operation node on the temporary list, the ready flag being an indicator of whether the thread is permitted to continue operation, and wherein the handler thread is configured to reset the ready flag corresponding to the thread after processing the operation node from the temporary list.
 16. The system as recited in claim 13, wherein the handler thread is configured to retrieve the temporary list into a new pointer, and to leave behind an initially empty list comprising the second temporary list, both via an atomic operation.
 17. The system as recited in claim 13, wherein the threads are configured to place an operation node in the temporary list via an atomic operation.
 18. A non-transitory machine accessible medium having instructions stored thereon, the instructions when executed on a machine, cause the machine to: add operation nodes to a temporary list, by at least one thread of a plurality of threads executing in a multi-threaded application running on the machine; assign a first thread that added an operation node to the temporary list a role as handler thread for the temporary list; wait by the handler thread for a flag indicating processing of the temporary list is permitted to commence; retrieve the temporary list and generate an initially empty second temporary list, by the handler thread, in an atomic operation, when the flag indicates processing of the temporary list is permitted to commence, wherein the second temporary list is to receive operation nodes by at least one thread when the at least one thread does not have an unprocessed operation node on the retrieved temporary list; and process the temporary list by the handler thread into a concurrent entity, wherein the concurrent entity comprises a concurrent priority queue, and an operation node comprises either one of a dequeue or enqueue operation to be performed on the concurrent priority queue.
 19. The medium as recited in claim 18, further comprising instructions to: assign a second handler thread associated with the second temporary list upon the second handler thread having added a first operation node to the second temporary list; and wait by the second handler thread until the temporary list has been processed by the handler thread before processing the second temporary list by the second handler thread.
 20. The medium as recited in claim 18, wherein operation nodes are added to the temporary list by threads, via an atomic operation.
 21. The medium as recited in claim 18, wherein the concurrent priority queue comprises an array data structure stored in the memory, where, when non-empty, at least a portion of the array elements are sorted into a heap.
 22. The medium as recited in claim 21, further comprising instructions to: process operation nodes in the temporary list by the handler thread, wherein enqueue operation nodes in the temporary list are added to the concurrent priority queue in constant time, and non-constant time dequeue operations in the temporary list are delayed until all constant-time enqueue and dequeue operation nodes in the temporary list have been processed by the handler thread; and re-sort the concurrent priority queue after the temporary list has been processed by the handler thread, when necessary.
 23. The medium as recited in claim 22, further comprising instructions to: add enqueue operation items to an unsorted portion of the concurrent priority queue until a dequeue operation node is processed, and wherein when a last item added to the unsorted portion of the priority queue has a higher priority than a highest priority item in the heap, then perform a dequeue operation on the last enqueued item added to the unsorted portion; and when all enqueue operation items have been added to the unsorted portion of the concurrent priority queue, then perform any remaining dequeue operation nodes and re-sort the heap portion of the concurrent priority queue as operation nodes are removed from the heap.
 24. A non-transitory machine accessible medium having instructions stored thereon, the instructions when executed on a machine, cause the machine to: process enqueue and dequeue operation nodes from a temporary list, the enqueue and dequeue operation nodes corresponding to operations to be performed on a concurrent priority queue associated with a multi-threaded application executing on the machine, the processing to be performed by a thread assigned to be a handler thread for the temporary list, and wherein the concurrent priority queue comprises an array data structure stored in memory, where, when non-empty, at least a portion of the array elements are sorted into a heap; process operation nodes in the temporary list by the handler thread, wherein enqueue operation nodes in the temporary list are added to the concurrent priority queue in constant time, and non-constant time dequeue operations in the temporary list are delayed until all constant-time enqueue and dequeue operation nodes in the temporary list have been processed by the handler thread; and re-sort the concurrent priority queue after the temporary list has been processed by the handler thread, when necessary.
 25. The medium as recited in claim 24, further comprising instructions to: add enqueue operation nodes to an unsorted portion of the concurrent priority queue until a dequeue operation node is processed, and wherein when a last item added to the unsorted portion of the priority queue has a higher priority than a highest priority item in the heap, then perform a dequeue operation returning the last enqueued item added to the unsorted portion; and when all enqueue operation nodes have been added to the unsorted portion of the concurrent priority queue, then perform any remaining dequeue operation nodes and re-sort the heap portion of the concurrent priority queue as operation nodes are removed from the heap.
 26. The medium as recited in claim 24, further comprising instructions to generate the temporary list, comprising: assign a first thread that added an operation node to the temporary list a role as the handler thread for the temporary list; wait by the handler thread for a flag indicating processing of the temporary list is permitted to commence; and retrieve the temporary list and generate an initially empty second temporary list, by the handler thread, via an atomic operation, when the flag indicates processing of the temporary list is permitted to commence, wherein the second temporary list is to receive operation nodes by at least one thread when the at least one thread does not have an unprocessed operation node on the retrieved temporary list.
 27. The medium as recited in claim 26, further comprising instructions to: assign a second handler thread associated with the second temporary list upon the second handler thread having added a first operation node to the second temporary list; and wait by the second handler thread until the temporary list has been processed by the handler thread before processing the second temporary list by the second handler thread.
 28. The medium as recited in claim 26, wherein operation nodes are added to the temporary list by threads, via an atomic operation. 